Inference on the eigenvalues of the covariance 
matrix of a multivariate normal 
distribution-geometrical view- 

Yo Sheena* 
September 2012 

Abstract 

We consider an inference on the eigenvalues of the covariance ma- 
trix of a multivariate normal distribution. The family of multivari- 
ate normal distributions with a fixed mean is seen as a Riemannian 
manifold with Fisher information metric. Two submanifolds naturally 
arises; one is the submanifold given by fixed eigenvectors of the co- 
variance matrix, the other is the one given by fixed eigenvalues. We 
analyze the geometrical structures of these manifolds such as metric, 
embedding curvature under e-connection or m-connection. Based on 
these results, we study 1) the bias of the sample eigenvalues, 2)the 
information loss caused by neglecting the sample eigenvectors, 3)new 
estimators that are naturally derived from the geometrical view. 
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1 Introduction 

Consider a normal distribution with zero mean and an unknown covariance 
matrix, N(0, £). Let denote the eigenvalues of £ by 

A = (Ai, . . . , A p ), Ai > . . . > A p 
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and eigenvectors matrix by -T, hence we have the spectral decomposition 

27 = r\r\ A = diag(A), (1) 

where diag(A) means the diagonal matrix with the ith diagonal element A«. It 
is needless to say that inference on 27 is an important task in many practical 
situations in such a diversity of fields as engineering, biology, chemistry, 
finance, psychology etc. Especially we often encounter the cases where the 
property of interest depends on 27 only through its eigenvalues A. We treat 
an inference problem on the eigenvalues in this paper. 

Geometrically viewed, the family of normal distributions N(0, 27) is taken 
as a manifold (say S) with a single coordinate system 27. Hence, S is iden- 
tified with the space of symmetric positive definite matrices. Geometrically 
analyzing the space of symmetric positive definite matrices is an interesting 
topic in a mathematical or engineering point of view. For example, see Smith 
[22], Fletcher and Joshi, [10], Lenglet et. al. [16]. From a statistical point of 
view, it could give us some new insights for the inference on 27. In this paper, 
we analyze S from the standpoint of information geometry while focusing on 
the inference on the eigenvalues of 27. The two kinds of submanifolds play 
an important role, that is, a submanifold given by fixed eigenvectors of the 
covariance matrix and the one given by fixed eigenvalues. 

Based on independent n samples x.i = (xn, . . . , Xi P )', i = 1, . . . , n from 
this distribution, we want to make inference on the unknown A. It is well- 
known that the product-sum matrix 

n 

S = Xix\ 

1=1 

is sufficient statistic for both unknown A and i~\ The spectrum decomposi- 
tion of S is given by 

S = HLH\ L = diag(Z), 

where 

I = (Zi, . . . , l p ), li > . . . > l p > a.e. 

are the eigenvalues of S, and H is the corresponding eigenvectors matrix. 
This decomposition gives us two statistics available, i.e. the sample eigenval- 
ues I and the sample eigenvectors H. However it is almost customary that we 
only use the sample eigenvalues, discarding the information contained in H. 
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In the past literature on the inference for the population eigenvalues, every 
notable estimator is based simply on the sample eigenvalues. Here we just 
mention the literature in the statistical field that deals with the estimation of 
A under the situation p < n; Stein [23], Takemura [24], Dey and Srinivasan 
[7], Haff [11], Yang and Berger [25] for orthogonally invariant estimators of 
S\ Dey [6], Hydorn and Muirhead [12], Jin [13], Sheena and Takemura [20] 
for a direct estimator of A. 

In a sense it is natural to implicitly associate the sample eigenvalues to the 
population eigenvalues, and the sample eigenvectors to the population coun- 
terpart. However the sample eigenvalues is not sufficient for the unknown 
population eigenvalues. Therefore it is important to evaluate how much in- 
formation is lost by neglecting the sample eigenvectors. Following Amari [1] , 
we gain an understanding of the asymptotic information loss with geometric 
terms such as Fisher information metric and embedding curvatures. 

Another statistically interesting topic is the bias of I. It is well known that 

1 is largely biased and the estimators mentioned above are all modification 
of I to correct the bias, that is, "shrinkage estimators." We show that the 
bias is closely related to the embedding curvatures. Moreover the geometric 
structure of S naturally leads us to new estimators, which are also shrinkage 
estimators. 

We briefly introduce the order of the contents in this paper. In Section 

2 and 3, we treat respectively Fisher information metrics and embedding 
curvatures, and present their specific forms related to the eigenvalues infer- 
ence. In Section 3, we refer to the bias of I in relation to the curvatures. In 
Section 4, we get the explicit form of the asymptotic information loss caused 
by discarding the sample eigenvectors in the first and second orders w.r.t. 
the sample size. In the final section, we propose the new estimators which 
is naturally derived in a geometrical view. All the proofs of the propositions 
are collected in Appendix. 

For geometrical concepts used in this paper, refer to Amari [2], Amari 
and Nagaoka [3]. 

2 Riemannian Manifold and Metric 

The density of the normal distribution N(0, U) is given by 
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If we let <7jj and a lj denote the element of respectively U and £ 1 , 
then the log likelihood equals 



log /!;(*) = J2 X * (- a "/2) + {-<J ij ) ~ (p/2) log27r - (1/2) log |27| 

i i<j 

= 5>i0" + $> y 6> y - ^( @ ) ( sa y 'G/; ©)), 



(2) 



where 6 = (9 lJ )i<j and y = {yij)i<j are given by 



< 



( 0* = (-l/2)<r ii , 



i = l,...,p, 

l<i <j <P, 
i = l,...,p, 
1 < % < j < p, 



(3) 



and 



^(6) = (p/2)log27r + (l/2)log|27(e)|. (4) 

The summations in the equation (2) is an abbreviation for Y^=i an d Xa<i<j< P > 
and we will use these kinds of notations hereafter without mention. 

The expression (2) gives natural coordinate system of the manifold S as 
an full exponential family. Another coordinate system, so called expectation 
parameters, are also useful, which are defined as; 



1 < % < j ' < p. 



(5) 



For the analysis of the information carried by I and H, we need to pre- 
pare another coordinate system. The matrix exponential expression of an 
orthogonal matrix O is given by 



O = expt/ = J p + U + \u 2 + + 



(6) 



where I p is the p-dimensional unit matrix, U is a skew-symmetric matrix 
and parametrized by u = (wij)i<i<j< p as 

Uij, if 1 < i < j < p, 
(U)ij = { -u i:j , if 1 < j < i < p, 
0, if 1 < % — j ' < p. 



The function expU is cliff eomorphic, and u gives "normal coordinate" for 
the group of orthogonal matrices. We can use this coordinate as local sys- 
tem around I p and construct an atlas for the entire space of p-dimensional 
orthogonal matrices (note this space is compact); for each r, there exists an 
open neighborhood and some open ball B in Rp^p- 1 )/ 2 around the origin such 
that these spaces are diffeomorphic by the function r expU(u) on B. 

We will use (A, u) as the third coordinate system of S and call it "spectral 
coordinate (system)". Notice that this coordinate system is associated with 
the following submanifolds in S. If we fix r in (1), then we get a submanifold 
A4(r) embedded in S with a coordinate system A. This is a subfamily in 
N(0, S) and called curved exponential family. Its log-likelihood is expressed, 
as we emphasize it as a function of A, to be 

l(y; 0(A)) = $>^(A) • »,^(A) - ^(6(A)). (7) 

i i<j 

On the contrary, if we fix A in (1), we get another submanifold A(X) in S, 
whose coordinate system is given by u in a neighborhood of each point of 
A(X). Its log-likelihood expression is given by 

J(l/;6(u)) = 5>«0*(u) + X> -^'(«) - (8) 

i i<j 

First we consider a metric, that is, a field of symmetric, positive definite, 
bilinear form on S. The statistically most natural metric is Fisher informa- 
tion metric. Suppose {f(x; 6} is a parametric family of probability density 
functions, whose coordinate as a manifold is given by 6 = (9i, . . . , 9 P ). Then 
the component of Fisher information metric with respect to 6 is given 
by 

d , d 



Ea 



For the multivariate normal distribution family, N(fx, S) (/x, the mean pa- 
rameter is also included), Skovgaard [21] gives a clear form of Fisher infor- 
mation metric. The tangent vector space at a fixed point U w.r.t. (o"ij)j<j 
coordinate can be identified with the space of symmetric matrices. For any 
symmetric matrix A, B, the metric with respect to the £ = (a^) coordinate 
system is given by 

^{S-^AS-'B). (9) 
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We are interested in Fisher information metric with respect to the spec- 
tral coordinate (A, it). Let d a , db, ■■■ denote the tangent vectors w.r.t. 
the A coordinate, d( s ,t), d( U;V ),--- denote the tangent vectors w.r.t. the it 
coordinate. Namely 

a a d A d 

On = T7^, C( S)t ) = 



d\ n ' ^ du 



St 



These tangent vectors (exactly speaking, vector fields) are invariant with 
respect to the orthogonal transformation of U; For some orthogonal matrix 
O, the orthogonal transformation Fof S is defined as 

F(S) = OSO 1 (10) 

For any 0, 

F*(d a ) = d a , l<a<p, (11) 

F m (d M ) = d M , l<s<t<p. (12) 

Proposition 1 Let ( , ) denote Fisher information metric based on x ~ 
N(0,S), then the components of the metric with respect to (A, it) is given 
as follows; 

9ab(x) = (d a ,d b ) = (1/2)A~ 2 8(a = b) l<a,b< p, 

9a( s ,t)(x) = (d a , d (Stt) ) = I <a<p, I < s <t <p, 

9(s,t)(u,v)(x) = (d( St t),d( UtV )) 

= (X s - A t ) 2 Aj 1 A7 1 6((s,t) = (u,v)) 1 < s < t < p, 1 < u < v < p. 
S(-) equals one if the logic inside the parenthesis is correct, otherwise zero. 

There are two remarkable properties of the metric for the spectral coordi- 
nate. First since the metric components matrix is diagonal, (d a , • • • , <9( s ,t), • • • ) 
is an orthogonal coordinate system, especially that the submanifolds A4(r) 
and A(X) are orthogonal to each other for any A and r. Second it is inde- 
pendent of r, hence the metric is invariant with respect to the orthogonal 
transformation F in (10) for any orthogonal matrix O. (Second property is 
instantly derived from the expression (9).) 

The fundamental structure of S is determined by an afline connection, 
and it gives birth to curvature, geodesies, etc. As it is seen in Amari [1], [2], 
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Amari and Nagaoka [3], e-connection and m-connection is important for the 
analysis of statistical exponential family, but we don't make further reference 
here. All results related to Levi-Civita connection for S can be found in 
Skovgaard [21]. For a-connection for the multivariate normal family with 
zero mean, refer to Eguchi [9] . A family of multivariate normal distributions 
as a Riemannian manifold is studied by several authors. Refer to Fletcher 
and Joshi [10], Lenglet et. al. [16] and Lovric et. al. [17]. 

We just mention the fact that S is e-flat and m-flat, and corresponding 
affine coordinations are given respectively by (cr lJ ') and 



3 Embedding Curvature 

For the analysis of the asymptotic distribution (l,H), it is important to 
observe the embedding curvatures of M. and A. (See Amari and Kumon 
[4], Kumon and Amari [14].) Especially, asymptotic information loss can be 
explained in view of these curvatures. Embedding curvature shows how the 
submanifold M or A is placed in the manifold S; We consider the following 
embedding curvatures; 

1. Embedding curvature of M, with respect to e-connection or m-connection. 

e ^ e to ^ m 

Hab(s,t)= (Vd,Pb , d(s,t)), Hab(s,t)= (Va<A, , <9( s ,t)), (13) 

e 

where Vdflb is the covariant derivative of <9& in the direction of d a with respect 

m 

to e-connection. Vdflb is similarly defined. 

2. Embedding curvature of A with respect to m-connection 

to A m 

H( s ,t)(u,v)a= (Vd (s , t) d{u,v) , d a ), (14) 

rn 

where Vd (s t p{u,v) is the covariant derivative of <9( Sji ) in the direction of d( U:V ) 
with respect to m-connection. 

On these curvatures at the point (A, J"), we have the following results. 

Proposition 2 For 1 < a,b < p, 1 < s < t < p, 

em 

Hab(s,t)=Hab(s,t)= 0. (15) 
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For 1 < a < p, 1 < s < t < p, 1 < u < v < p, 

{A~ 2 (A t - A a ), if s = u = a, t = v, 
A~ 2 (A S - A a ), if s = u, t = v = a, (16) 
0, otherwise. 

Another expression of the embedding curvature of A is given by 

H( s ,t)(u,v)= H( s ,t)(u,v)b 9 a , 

b 

With this notation, the orthogonal projection of the covariant derivative 

m 

onto the tangent space of M. is given by 

2^ H ts,t){u,v) d a - 
a 

From Proposition 1, 2, we have 

m 

H( s ,t)(u,v)= 2(Ai - \ a )5(s = u = a, t = v) + 2(A S - \ a )5(s = u, t = v = a), 
hence 

STm f> j^t-K)d s + 2(X s -X t )d t , if (s,t) = (u,v), 
2-. H ^ d « = { , otherwise. 

An embedding curvature has full information about the "extrinsic curva- 
ture" of the embedded submanifold in any direction. Sometimes it is con- 
venient to compress it into a scalar measure of the curvature. "Statistical 
curvature" by Efron (see Efron [8], Murray and Rice [19]) is such a measure; 
For A, it is defined by (see pl59 of Amari [2]) 

7(^4) = E E #(.,«)(«,„)« H(o, p) ( q ,r)b g (s > t)M g (u ' v)M g a \ 

l<a,b<p s<t,u<v,o<p,q<r 

which attains the following value at the point (A, r). 



8 



Corollary 1 

a<b 

From these results, we notice that if S is endowed with m-connection, 
then 1) the embedding curvatures and the statistical curvatures of A are 
independent of r, 2) any one-parameter curve (A, r(u)) given by the pa- 
rameter ii( s ,t), s < t, while A and the other elements of u are fixed, is curved 
in the direction of d t — d s and contained in a two-dimensional plane composed 
by <9( Sj t) and d t — d s , 3) the statistical curvature of A could be quite large 
when A are close to each other, while Ai is flat everywhere. 

Here we introduce another submanifold A which is contrasting to A in 
the sense that A is flat with respect to m-connection. For a point (A, J 1 ), let 

A(\, r)±{zes\ (r'sr^i = 1 < Vi < P }. 

We easily notice that A is the minimum distance points with respect to 
Kullback-Leibler divergence. That is, 

A(X, r) = {EeS\ argmin^ KL(S, rdiag(Ai, . . . , A p )r*) = A}, 

where KL(S, S) is the Kullback-Leibler divergence between N(0, S) and 
N(0, £), which is specifically given by 



tr(M -1 ) -log|27^ _1 | 



p. 



The minimum distance points with respect to the Kullback-Leibler distance 
consists of all the points on the m-geodesics which pass through the point 
(A, r) and are orthogonal to A4(r) at that point. (See Theorem in A2 of 
Amari [1]). 

We can visualize the structure of S endowed with m-connection for the 
two dimensional case. See Figure 1, where A4i = A^(i^), i = 1,...,3, 
Ai = A(\i), i — 1,2 and A\ = A{\\, T\) are drawn. When p = 2, Ai is a 
two-dimensional autoparallel submanifold with the affine coordinate (Ai, A2), 
while A is a one-dimensional submanifold with an coordinate W(i,2)- As it is 
seen in Proposition 1, all the tangent vectors <9i(= <9 2 (= ^), <9(i i2 )(— 
oJ* - ) are orthogonal to each other. A is a "straight" line which is also 
orthogonal to M. The arrow on M. is the line {A|Ai + A 2 is constant}, and 
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Figure 1: M, A and A 



the arrow head indicates the direction in which c = A 2 /Ai increases. The 
statistical curvature turns out to be the increasing function of c ; 

1 + c 2 

We can analyze the bias of li = n~ l U, i — 1, . . . , n from the geometrical 
structure of S. It is well known that E[li] (i = 1, . . . ,p) majorizes Aj (i = 
1, . . . , p), that is, 

X>[^]>X>' l<Vj<p-l, f>[ r «] = X>- (18) 

i=l i=l i=l i=l 

The bias E[li] is quite large when n is small and Aj's are close to each other 
(see Lawley [15], Anderson [5]). For the case p = 2, 

£[Zi]>Ai, E[l 2 }<\ 2 . (19) 
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Figure 2: A and A on c-coordinate 



Suppose a sample S = n~ 1 S takes the value at a point s G 5. Let Si denote 
the point on A4(r) designated by the eigenvalues of S, namely I = (/i,/ 2 ). 
The curve A(l) connects s and s\. If we define s 2 as the point on M.(r) 
designated by A = (Ai, A 2 ) = ((r*5r)n, (r*Sr) 22 ), then „4(A, P) connects 
s and s 2 . The three points s, si and s 2 are on the same plane, and if we move 
from Si in the direction to s 2 , then the statistical curvature of A increases 
(see Figure 2). If we estimate (Ai, A 2 ) = (<7n, <t 22 ) by I, then the estimand is 
the point si, while for the unbiased estimator A, the estimand is the point s 2 . 
Since the c-coordinate of si is always smaller than that of s 2 , the estimator 
(^1, ^2) is likely to estimate Ai and A 2 too apart, which causes the bias (19). 
It is also seen that the bias gets larger when c approaches to one, that is, Ai 
and A 2 get closer to each other. 

4 Information Loss 

Now we consider the information loss caused by ignoring H for the inference 
on A. Information loss matrix (Ag ab (l)), 1 < a, b < p at a fixed point 
£ = (A, r) is given by 

Ag ab (l) ^ E[g ab (S\l)} = g ab (S) - g ab (l), 

where g a b(S) , g ab (l) , g a b(S\l) are the components of the metrics w.r.t. d a and 
d b based on respectively the distributions S, I and the conditional distribu- 
tion of S given I, all of which are measured at the point 27 = (\,r). 
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Amari [1] found that the asymptotic information loss can be expressed in 
terms of the metric and the embedding curvatures; 



Ag ab (l) = n ^2 9a{s,t)9b { u,v)9 

s<t,u<v 

Ee e ^ 

Hac(s,t) Hbd(u,v) 9 9 

c,d,s<t,u<v 

Emm , 
H (s,t)(u,v)a H(o,p){q,r)b g ( "'"' y "" J ' g" 



(s,t)(u,v) 

la\s,t)yO(U,v)y 

s<t,u<v 

cd (s,t)(u,v) 

J--iac(s,t) ±±oa(u,v) y 

c,d,s<t,u<v 

(s,t)(o,p) n (u,v)(q,r) 

l-L (s,t)(u,v)a -L-l (o,p)(q,r )b y 
s<t,u<v,o<p,q<r 

+ 0(n- 1 ). (20) 



Proposition 3 

Ag ab (l) = B ab + 0(n- 1 ), 

where 

B a b at the point (A, r) depends only on A. When the information loss 
of a statistic has the order 0(n~ q+1 ), we call the statistic is the gth order 
sufficient. Consequently the statistic I is the first order sufficient, but not 
the second order sufficient. B^, the information loss in the first order term 
(0(1)) is negligible when the eigenvalues are separated to some extent. On 
the contrary, it is quite large when the population eigenvalues are close to 
each other. Note that the information carried by I is given by the formula; 

9ab(l) = 9ab(S) - Ag ab (l) 
= ng ah {x) - Ag ab (l) 
= (n/2)\f5(a = b)-Ag ab (l). 

Since (g a b(l)) is positive definite, Ag ab (l) must be bounded in the neigh- 
borhood of a point where X± — ■ ■ ■ — X p . This indicates that the term of 
order 0{n~ l ) in B ab is also unbounded in such a neighborhood. Hence the 
expansion of the information loss with respect to n is not useful when the 
population eigenvalues are close to each other. 
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Figure 3: A and A on c-coordinate 



In order to check the information loss numerically, we carried out a simple 
simulation. Consider the null hypothesis 

H : 27 = I p against alternative Hi : U ^ I p . (21) 

The ordinary likelihood ratio test based on S (say the first test) is given by 
the rejection region 

(e/n) pn/2 exp(-(l/2)trS)|S| n/2 < a a , (22) 

where a a is chosen so that the size of the test is a. See e.g. Theorem 8.4.2 
of Muirhead [18]. On the other hand, the likelihood ratio test based on the 
distribution of I (say the second test ) is given by 

exp(-(l/2)E<*0 < b 

su Pa ECU A7 n/2 f 0(p) exp(-(l/2)trMJf'A-i)^(Jf) 

13 



since the density function of I is given by 



p 



T\(k-lj) [ exp(-(l/2)trMJEf*27 

i<i J o(p) 



n-p-l)/2 



-1 



where dfi(H) is the uniform probability on the group of p-dimensional or- 
thogonal matrices, 0{p), and K is the normalizing constant (see Theorem 
3. 2. 18 of Muirhead [18]). 

Figure 3 shows the simulation result on the powers of the two tests for 
the case p — 2, n — 10. The alternative hypotheses are taken in various 
directions from the null point (Ai, A 2 ) = (1, 1); 



where 9 = tx /A— (j — l)7r/50, j — 1, ... ,51. (Note that the alternatives are not 
at the same distance from the null w.r.t. the metric of Proposition 1. )The 
powers for each alternative are the mean values from 10 5 times repetition. 
For the integral calculation in (23), we picked up 100 points from 0(2) in an 
equidistant manner. The graph shows that the second test performs as well 
as the first test, hence it indicates the information loss is not serious at least 
around the point (Ai, A 2 ) = (1, 1). 



5 A New Estimator Based on Geometrical 
Insight 



In this section, we apply the geometrical analysis in the previous sections to 
the estimation of A. We consider two cases where the population eigenvec- 
tors, r, is respectively known and unknown. 

5.1 Case where r is known 

First let's consider the case where r is known. In this case, statistics 
(i^ST)^, i = l,...,p is the second order sufficient, since A(X,T) is or- 
thogonal to M(r), and the m-embedding curvature of A(X,T) vanishes 
(see the equation (20) ). Therefore the estimator 



Hi : (Ai,A 2 ) = (I,l) + (l/v^)(cos0,sin0), 



U(Ai,..,Ap)=(( 



r t sr) ll ,...,{r t sr) pp ) 



(24) 
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Figure 4: Risks of 27^ and IJ^ as A changes 



is superior to the ordinary estimator I = (n~ l li, . . . , n _1 / p ) in view of the 
asymptotic information loss. Especially consider the case where we know 
that r — I p , that is, each variate is independent. Then the estimator \ 
becomes the ordinary sample variance for the ith variate. The lesson is 
that if each variate is independent, then we should tackle with each variate 
separately, don't treat them as a multivariate set. 

We made a numerical evaluation of the superiority of A to I. We used 
Kullback-Leibler divergence as a loss function (say KL-loss). Namely for the 
true parameter U = diag(Ai, . . . , X p ) and the estimators 

£W = diag([ 1 , ...,/;), = diag(A 1 , . . . , A p ), (25) 

we evaluate the loss of the estimator as 

KL{E^,E)=\x{E ( ^S- 1 )-]og\S ( ^E- 1 \-p, i = 1,2. (26) 
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Figure 5: Risks of and as r changes 



We compared the risks E[KL{IJ^\ £)], i — 1, 2 for the case p — 2, ra = 10 
under the condition (Ai, A 2 ) = (1, c), where c varies from one to 0.02 by the 
increment of 0.02. Note that both estimators and the loss function (26) are 
all scale invariant. Figure 4 shows the result. We repeated 10 5 times risk 
evaluation for each c and took the average. Theoretically the risk of is 
constant. The perturbation of the dotted line is due to the simulation error. 
We observe that the risk of A is reduced by nearly 30 % compared to I in the 
neighborhood of A = (1, 1), where the second order term of the information 
loss of I is maximized. 

In a practical situation, it is rare to know the exact population eigen- 
vectors beforehand. It is more likely that we only have a vague knowledge 
about the eigenvectors. For example, we sometimes have a prior knowledge 
that each variate is almost uncorrelated to each other although they are not 
perfectly independent. In this situation, which is better to use A = (Sa)i<i< p 
or I ? The following simulation suggest that the superiority of A to I could 
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hold good in a relatively large neighborhood of M(I P ). Figure 5 shows the 
change of the risks of the two estimators w.r.t. K-L loss under the condition 
A = (1, 0.8), n = 10 as T varies with 9 (0 < 6 < tt/2) as 

_ / cos 9 — sin 9\ 
ysin9 cos9 J 

For each r, the repetition was done 10 5 times and we took the average as the 
risks. IJ^ has a constant risk, though the risk line for IJ^ is perturbed by 
the simulation error. The graph shows that IJ^ still performs better than 
in spite of the situation r ^ J 2 . 



5.2 Case where r is unknown 

As we saw in the previous subsection, if we have information about r, the 
estimator A works well. If we do not have information about r beforehand, 
a naive idea is to estimate it by the sample eigenvectors matrix H and 
substitute it into A. However this only reproduces I. 

H is a point estimator (m.l.e.) of r, while we could construct a distribu- 
tion of r based on the sample S. If we take the expectation of (r t Sr) ii (i = 
1, . . . , p) with this distribution, it could produce a reasonable estimator. First 
consider the conditional distribution of H when I is given. Since S = HLH 1 
is distributed as Wishart matrix W p (n,U), its density w.r.t. the uniform 
probability d/i(H) on the group of p-dimensional orthogonal matrices 0(p) 
equals 

f(H\l;S) = K(l ; S)' 1 exp(-(l/2)trJf LiT'IT 1 ) , (27) 
where normalizing constant K{1 ; U) is given by 

K{l-S)= [ exp(-(l/2)trHLH t i;- 1 )dfi(H). 

JO(p) v ' 

This conditional distribution depends on 27. If we substitute £ with the 
estimator S, a density of r on 0{p) with respect to d/i(r) is given by 

f(r\l;S) = K(/;5)- 1 exp(-(l/2)trrxr*5- 1 ), (28) 

where 

K(l;S)= [ exp(-(l/2)tirLr t S~ 1 )dfi(r). 

Jo(p) v 1 
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Take the expectation of (Z^ST 1 ^ w.r.t. the density (28), 

a* = k(i ; sy 1 [ (r*5r)^ex P (-(i/2)trrxr*5- 1 )^(r), i = l,...,p. 

Jo(p) v J 

(29) 

Let L denote diag(Z). Because of the invariance of dfi, it turns out that 
A* = K^iy 1 [ (r t HLH t r) ii exp(-(l/2)trLr t HL- 1 H t r)dfi(r) 

Jo(p) v ' 

= Kiiy 1 [ (r t Lr) ii exp(-(n/2)tiLr t L~ 1 r)dfi(r) (30) 

JO(p) v ' 

where 

K(l) 4 I exp(-(n/2)trLr t L- 1 r)dfi(r). (31) 
JO(p) v y 

We propose A* = (A*, . . . , A*) as a new estimator of A. It is easily proved 

that A* — > li, % — 1, . . . ,p, as n — > oo, and A* is a "shrinkage" estimator. 
The analytic evaluation of this estimator's performance seems difficult even 
for the large sample case. Instead we show the numerical result comparing 
I and A . We simulated the risks of both estimators w.r.t. K-L loss for the 
case p = 2. Since they are functions of I and scale invariant, it is enough to 
measure the risks for £ = diag(l,c), < c < 1. We varied c from 0.04 to 
1.00 by the increment 0.04, and for each c we repeated the risk evaluation 
10 4 times and took the average. For the integral calculation of (30) and (31), 
we picked up 50 points from 0(2) in an equidistant manner. Figure 6 shows 
the result. It seems that the new estimator performs better compared to I, 
especially A are close to each other. 



6 Appendix 

6.1 Proof of Proposition 1 

As a base for the vector space of real symmetric matrices, we consider 
Eij (1 < i < j < p) which is a p x p matrix defined by 

[Iij + Iji, if /' < ./. 
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Figure 6: Risks of I and A* as c changes 

where Jy (1 < i,j < p) is the p x p matrix whose element equals one, 
and all the other elements are zero. The one to one correspondence 



d^^^—^Eij, l<i<j<p, 

(JO; 



gives the component expression of (9) 



( 9 «J) >9 (*,0) = hr^E^S-'E^, 1 < i < j < p, 1 < k < I < p. 



Since 
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we have the following relations 

^<«> = &*-'(E^ME <*> 
«^.) = ^'(Ef>MEi^")}. <»> 

where l<a,fr<£>, l<s<t<£>, 1 < w < t> < p. 

For the first order derivative at u = 0, we only have to consider U up to 
the term to the first power w.r.t. u, hence we put it) as 

u) = r(i p + u)A(i p + c/)*r* + o(||^|| 2 ) 
= rAr' + rKu'r 1 + ru\r l + o(||u|| 2 ). (37) 

Therefore we have 

where uu = (1 < i < p), Uij = —Uji (1 < j < i < p), which leads to 



d\ a 



lioTijai 



u=0 



and 



da. 



du 



St 



\litljs - Klisljt + Klisljt - ^slitljs- 



u=0 



From (38) and (39), we have the following results on tangent vectors; 

i<3 ^ i<j 

where j a is the ath column of -T, and 



(38) 



(39) 



(40) 



da i4 



dir t Eii = A * 7 * 7 * ~ As7s7 * + A * 7s7 * ~ As7 * 7 



(41) 
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If we substitute (40) and (41) into (34), (35) and (36), we get the results as 
follows; 

2g ab (x) = tr (iTSaTa^V^) 

= tr{( 7 *i;- 1 7(l ){(7^-S)} 

= tr(A-M(a = b)X^5{a = 6)) 
= A a ~ 2 5(a = b), 

2g aM (x) = trjlTV)^ -1 (A t7t7 * - + A t7s7 * - A s7t7 *) } 

= X t X~ 2 5(a = s = t)- X s X~ 2 5(a = s = t) 
+X t X~ 2 5(a = s = t) - X s X~ 2 5(a = s = t) 

= 0, 

2g {sMu ,v){x) = tr{27 _1 (A t 7 t 7* - A s7s7 * + A t7s7t ' - A s7t7 *) 

x E' 1 (\ v j v jI - Xulull + A„ 7u7 * - A u7 „ 7 *) } 



= X t X v X^S(u = t)A;^(s = v)- XtXuX^Siv = t)X^5(s = u) 



+ X t X v Xt l 5(v = 




= u) - X t X u Xt l 5{u 


= ^Xj'Sis = 


v) 


— X s X v Xj 1 5(u = 


s)XT 1 5(t 


= v) + A s A M A; 1 5(s 


= v)X^5(t = 


u) 


- X s X v X^ 1 5(s = 


v)x; 1 5(t 


= u) + X s X u X~ l 5{u 


= s)X; 1 5(t = 


v) 


+ X t X v X^ 1 5(u = 


s)X; 1 5(t 


= v)- X t X u X~ l 5(v 


= s)X; 1 5(t = 


u) 


+ X t X v X~ 1 5(v = 


s)XT 1 5(t 


= u) - X t X u X~ l 5(u 


= s)X; 1 5(t = 


v) 


— A s A 1) A < T 1 5(m = 


0A;M( S 


= v) + XgXuX^div 


= t )x: 1 5(u = 


s) 


- X s X v Xt 1 5(t = 




= s) + A S A M A < T 1 5('U 


= VXj'Sis = 


v) 


-1 + AA; 1 - 1 + A s A t _1 + AA; 1 - 1 + A^A^ 1 - 


- l)S(s — u,t 


= v) 



= 2{X~ s \X t - A s ) + X^{X S - X t ))S(s = u,t = v) 
= 2(X t - A S )(A; 1 - A- 1 )5(s = u,t = v) 
= 2(X t - A s ) 2 (A s A t ) _1 5(s = u,t = v). 
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6.2 Proof of Proposition 2 

Note that U' 1 = TA^r*, hence 



0*3 = 



-a- 1 !^* 1 if 



This means M. is an affine subspace of S w.r.t. an 0, which is an affine 
coordinate system of S with e-connection. Consequently M. is e-flat, i.e. 

e m 

H a b(s,t)= 0. H a b(s,t)= is similarly proved. See Theorem 1.1 in Amari and 
Nagaoka [3]. 

m 

Now we consider H( s ,t)(u,v)a- Using (4.14) in Amari [2], it is calculated as 



TO 

H( s ,t)(u,v)a = 



E 


d 2 a ij 




du st du uv 


u=O dX a 


-2" 


du st du uv 

±<i,j <p 



u=0 



da ij 



u=0 dX a 



u=0 



= -2 _1 tr(AB), 
where p x p matrices A, B are given by 



(Ah 



d 2 a. 



du st du uv 



, (B) 



A ^ 



tt=0 



9A a 



, l<i,j<P- 



u=0 



In order to calculate A, we only have to consider U up to the terms powered 
by two w.r.t. u. 

s = r(i p + u + 2~ 1 t/ 2 ) a(i p + u + 2~ 1 u 2 yr t + o(\\u\\ 3 ) 
= r\r l + r(u\ + AE/')r* + 2~ 1 r(u 2 \ + A(c/ 2 )*)r* + ruAtfr 1 

+ 0(||«|| 3 ). 
Therefore is expressed as 

an = 2- 1 ^7 ifc7j7 ((C/ 2 A + A(C/ 2 )*) M + 2(C/AC/*)H)+^ + 0(|| W || 3 ), (42) 

k,i 
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where R = TAP* + r(UA + ACT*)/ 1 *. Since 



(U 2 A + A(U 2 Y) kl = {U 2 A) kl + (U 2 A) lk 

= y] u kh u u \i +^2u lb u bk \ k , 
b b 

2(UAU t ) k i = 2^u kb uibK 

b 

(42) truns out to be 
on = 2~ 1 ^2'j ik 'y j i(u kb u b i\i + ui b u bk \ k + 2u kb ui b \ b ) + R i:j + 0(||w|| 3 ). (43) 

k,l,b 

From this we have 

x2 = £(«g> + ««+«» + $+«« + ««>), (44) 
where 

a§ } = 7is7j4>^(5{(A;, b) = {s, t), (b, I) = (u, v), (s, t) ^ (u, v)} 
-litljv^vS{{k,b) = (t,s),(b,l) = (u,v),(s,t) ^ (u,v)} 
-lisljuK5{{k,b) = (s,t),(b,l) = (v,u),(s,t) ^ (u,v)} 
+ litljuKS{(k,b) = (t,s),(b,l) = (v,u),(s,t) ^ (u,v)} 
+ liuljA t S{(k, b) = (u, v), (b, I) = (s, t), (s, t) ^ (u, v)} 
-livljt\${{k,b) = (v,u), (b,l) = (s,t), (s,t) ^ (u,v)} 

- liuljsK&{{k, b) = (u, v), (b, I) = (t, s), (s, t) ^ (u, v)} 
+ %vi js X s 5{(k, b) = (v, u), (b, I) = (t, s), (s, t) ^ (u, v)}, 

a?) =2 lisljt \ t 5{(k,b) = (s,t),(b,l) = (s,t),(s,t) = (u,v)} 
+ 2j it j js X s S{(k,b) = (t,s),(b,l) = (t,s),(s,t) = (u,v)} 

- 2-f is j js \ s 5{(k, b) = (s, t), (b, I) = (t, s), (s, t) = (u, v)} 

- 2 lltljt \ t 5{(k, b) = (t, s), (b, I) = (s, t), (s, t) = («, v)} 
= -2-f is j js \ s 5{(k, b) = (s, t), (b, I) = (t, s), (s, t) = (u, v)} 

- 2%a jt X t 5{(k, b) = (t, s), (b, I) = (s, t), (s, t) = (u, v)}, 



d 2 a„ 



du st du, 



23 



%■ =2-f is -f ju \ t 5{(k,b) = (s,t),(l,b) = (u,v),(s,t) ^ (u,v)} 



- 2'j it 'j ju \ s 5{(k,b) = 


= (t, 


s),(l,b) = 


= (s,t) 


^(u 


l v ) 


- 2'j is 'j jv \t5{(k,b) = 


= (s, 


t),(l,b) = 


= (s,t) 


^{u 


, v ) 


+ 2% t -/ jv X s 5{(k,b) = 


= (*, 


*),(*,&) = 


~- (v,u), (s,t) 


^{u 


, v ) 


+ 2^ iu ^ js X t 8{(k,b) = 


= (u 


v),(l,b)~- 


= (s,t), (s,t) 


^{u 


, v ) 


- 27„7 i A<H(M) = 


= (v, 


«),(*,&) = 


= (s,t), (s,t) 


+ [u 


, v ) 


- 2-f iu >y jt X s 5{(k,b) = 


= (u 


«),(/, 6) = 


= (t,s), (s,t) 


±{u 


l«) 


+ 2~f iv ~f jt \ s 5{(k,b) = 


-(v, 


«),(/,&) = 


= (t, s), (s,t) 







< j = A lisljs \ t 5{{k,b) = (l,b) = (s,t) = (u,v)} 
+ 4- fi a jt X s 5{{k,b) = (l,b) = (t, s) = (v,u)} 

- 4^ jt X t 5{(k, b) = (s, t), (I, b) = (t, s), (s, t) = (u, v)} 

- 4ry it j js X 8 6{(k,b) = (t,s),(l,b) = (s,t),(s,t) = (u,v)} 
= ^ is -f js X t S{(k,b) = (l,b) = (s,t) = (u,v)} 

+ 4- fi a jt X s 5{{k,b) = (l,b) = (t, s) = (v,u)}. 



2A = A (1) + (A (1) )* + A& + (A^Y + A^ + A< 4 \ (45) 



A {1) = lsl lXJ{t = u)+ jalXJis = v) 
+ lulWKs = v) + -y v ~flKS(u = t) 

- jalXJis = u,t^v)- 7 s 7*A u 5(t = v, s ^ u) 

- lvllhK u = s,t^v)- 7„7^A s 5(t = v ,s^u), 
A {2) = -2( 7s 7*A s 5(s = u ,t = v) + 7t7t V(« = u,t = v)), 
A^ = 2 (-fs-flXtSit = v,s^u)+ ~ra%S(s = u,t^v) 



Furthermore we have 



where 
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Since 



we have 



da ij 



-K 2 lialjo 



u=0 



B = -K'lall 



(46) 



From (45) and (46), we have 



H {s , t )(u,v)a = -4- 1 tr(2AB) 

= -4- 1 tr{(A( 1 ) + (A (1) )' + A< 2 > + (A^Y + A (3) + A^)B] 

= 4- 1 A- 2 tr{(A( 1 ) + (A (1) )' + A< 2 > + + A (3) + A< 4 >) 7a 7<}. 

The following equalities hold; 

tr(A (1) 7 a 7*) = X a 5(s — v — a,t — u) + X a 5(t = u = a, s = v) 
+ X a 5(t = u = a, s = v) + X a 5(s — v — a,t — u) 

— X a S(t — v — a, s — u,t ^ v) — X a S(s — u — a,t — v, s ^ u) 

— X a 5(t = v = a, s = u,t 7^ v) — X a 5(s = u = a,t = v, s ^ u) 
= 0. 

0. 

— 2(X a S(s — u — a,t — v) + X a S(s — u,t — v — a)). 
—2(X a S(s — u — a,t — v) + X a 5(s — u,t — v — a)). 
2{X t S(s = u = a)S(t — v, s ^ u) + X s 5{t = v = a)S(s — u,t ^ v) 
+ X t 8(s = u = a)5(t = v , s ^ u) + X s 5{t = v = a)5(s — u,t ^ v)} 

— 2{X s 5(t = u = 0)6(3 — v) + X t 5(s = v = a)5(t = u) 
+ X t S(s = v = a)5(t = u) + X s 5(u = t = a)5(s = v)} 

= 0. 

tr(A (4) 7 a 7*) = AX t 5(s — u — a,t — v) + AX s 5(t = v = a, s = u). 



tr((A (1) )'7 a 7*) 
tr(A< 2 > 7a7 *) 

tr((A( 2 ))' 7 a7*) 
tr(A( 3 ) 7 a7a) 
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Consequently 

m 

H( s ,t){u,v)a = ~K S(s = u = a,t = v)-X a 5(s = u,t = v = a) 

+ X~ 2 X t 5(s — u — a,t — v) + X~ 2 X s 5(t = v = a, s = u) 

A~ 2 (A t - A a ), if s = u = a, t = v , 
K 2 (^s - X a ), if s = u, t = v = a, 
0, otherwise. 

6.3 Proof of Corollary 1 

As we will see in the next subsection, 

n (s,t)(u,v)a n (o,p)(q,r)b U i/ 

s<t,u<v,o<p,q<r 

-, if a 7^ b. 



(X a - A a ) 2 

Combine this with Proposition 1, we have 



2 £ 



6.4 Proof of Proposition 3 

The term of the order n in (20) vanishes since g a i s ,t) equals zero for 1 < a < 

e 

Pi 1 < s < t < p. We consider the term of order 0(1). Since H ac (s,t) also 
vanishes for 1 < a, c < p, 1 < s < t < p, we only have to consider the term 

(1/2) Yl H {s ,t)(u,v)aH {0 , p){a , r)b gMio*) 

s<t,u<v,o<p,q<r 
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Because of (16), the above term equals 

2~ 2^ H(a,t)(a,t)aH(b,p)(b,p)b (O^' 



t>a,p>b 

m 



+ 2- 1 J2 H {a , t){a , t)a H (pMp , b) b(g {aMp ' b) ) 2 

t>a,p<b 

+ 2" 1 £ ^(t,a)(*,a)^(M(M & (5 ( *' aKM ) 2 
t<a,p>fe 

+ 2- 1 £ i(, a )(,a)ai( P , fe )( P , b ) fe (^ ( *' a)(?5 ' 6) ) 2 (47) 
t<a,p<fe 

If a = 6, then (47) equals 

2- 1 E(i (a ,)M) a ) 2 (^- t)( ^) 2 + 2- 1 ^(^ (t , a)( , a)a ) 2 (^^)^) 2 

= 2-{5>-u - + E(a-(a, - ^(^) 2 } 



, o A2(A a - A t ) 2 



If a < 6, then (47) equals 

2~ H(a,b)(a,b)aH(a,b)(a,b)b {9 ) 

= 2 _1 A~ 2 (A 5 - A a )A^ 2 (A a - A 6 ) -A 6 ) 2 ) 



2(A a -A 6 ) 2 ' 
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